This invention relates generally to labeling data and, more particularly, to methods and systems for classifying data using data classification techniques that are based on a hierarchical taxonomy of clustered data.
Automated classification, or “labeling,” of data may be used to efficiently organize, route, and/or process such data. As an example, support centers receive large amounts of documents related to support requests. Document labeling techniques, such as clustering, may be used to group together similar documents. For example, when a support center receives a set of documents, the support center may execute clustering software that extracts keywords from the documents. Based on the extracted keywords, the clustering software groups together documents that are associated with similar keywords to generate clusters of documents.
To label the various clusters, a support center may employ individuals to manually review clusters of documents and to determine an appropriate label for the clusters. Rather than reviewing all the documents in a cluster, an individual may review (e.g., read) a sample of documents. Based on the reviewed sample of documents, the individual may label the cluster.
Such techniques may provide a static labeling scheme in which a new incoming document is labeled based on the label previously assigned to a cluster of documents similar to the incoming document. However, topics represented by incoming documents may change over time, such that the creation of new labels is appropriate. In addition, the meaning of terms may change over time, such that a term previously associated with one topic may come to be associated with a new topic. A static labeling scheme may not adequately accommodate such changes.
Similar techniques may also be applied to documents other than support requests. For example, some organizations may use these techniques to organize a variety of documents (e.g., text files, emails, images, metadata files, audio files, presentations, etc.) accessible over the Internet. A static labeling scheme has similar limitations when applied to these types of documents.